A key requirement for leveraging supervised deep learning methods is theavailability of large, labeled datasets. Unfortunately, in the context of RGB-Dscene understanding, very little data is available -- current datasets cover asmall range of scene views and have limited semantic annotations. To addressthis issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M viewsin 1513 scenes annotated with 3D camera poses, surface reconstructions, andsemantic segmentations. To collect this data, we designed an easy-to-use andscalable RGB-D capture system that includes automated surface reconstructionand crowdsourced semantic annotation. We show that using this data helpsachieve state-of-the-art performance on several 3D scene understanding tasks,including 3D object classification, semantic voxel labeling, and CAD modelretrieval. The dataset is freely available at http://www.scan-net.org.
展开▼